Introduction

略

Matrix Multiplication

利用矩阵乘法介绍优化方案

采用 Python、Java、C 的运行时间不一样

Why is Python so slow and C so fast?

Python is interpreted.
C is compiled directly to machine code.
Java is compiled to byte-code, which is then interpreted and just-in-time (JIT) compiled to machine code.

i、j、k 循环调换位置后运行时间不一样

cache hits 和 cache misses

Clang 优化

不一定 O3 优化比 O2 优化快，有时候 O2 快，有时候 O3 快

Parallel Loops

The cilk_for loop allows all iterations of the loop to execute in parallel.

Rule of Thumb Parallelize outer loops rather than inner loops.

进一步优化-重用数据（tiling）

Restructure the computation to reuse data in the cache as much as possible. （Cache misses are slow, and cache hits are fast.）
Try to make the most of the cache by reusing the data that’s already there.

Parallel divide-and-conquer

cilk_spawn
/*
The child function call is spawned, meaning it may execute in parallel with the parent caller.
*/
cilk_sync
/*
Control may not pass this point until all spawned children have returned.
*/

Compiler Vectorization（编译矢量化）

Many machines don’t support the newest set of vector instructions, however, so the compiler uses vector instructions conservatively by default

mit-6-172-1